Project. Finding inefficient opertors in Telecom.

Additional stuff: Presentation Dashboard

Task.¶

Our customer is a platform «Callme», which supplies different services regarding VoIP (software, internet call channels, hardware etc.), the ultimate clients are the callcenters. Callcenters hire their staff - operators - to make outgoing calls and receive incoming calls from abonents (private individuals, companies etc - all depending on the business demand and circumstances). Each callceneter has its own incoming/outcoming calls' ratio according to its situation and requirements.

The customer has a need to learn how to isolate inefficient operators using special algorithms, which in the future can be sold to customers (callcenters) as a new service. We are required to analyze the available data on calls and develop such an algorithm.

Signs of low efficiency of operators that the customer suggests to take into account:

  • a lot of missed incoming calls,
  • long waiting for an answer for incoming external calls,
  • few outgoing calls — in the case of those call centers that specialize in outgoing calls and almost don't process incoming calls.

Description of the data.¶

Datasets describe the use of the services of "Callme", a virtual telephony provider. Its clients are the callcenters that:

  • distribute incoming calls to operators,
  • make outgoing calls by operators.

Also, operators can make internal calls — calls between each other inside the virtual telephony network.

The columns in telecom_dataset.csv.zip:

  • `user_id' — ID of the customer account of the callcenter in the service
  • `date' — Date of statistics
  • `direction' — Direction of calls (out - outgoing call, in — incoming call)
  • internal — Is the call an internal call between the client's operators
  • `operator_id' — Operator ID
  • is_missed_call — Is the call missed
  • `calls_count' — Number of calls
  • `call_duration' — Duration of the call (excluding waiting time)
  • total_call_duration — Duration of the call (taking into account the waiting time)

The columns in telecom_clients.csv:

  • `user_id' — ID of the client (callcenter) account in the service
  • tariff_plan — The current tariff plan of the client
  • `date_start' — Date of registration of the client in the service.

Research plan.¶

  1. Uploading and checking data for duplicates, omissions.
  2. Elimination of data errors, filling in gaps, output of the main parameters of the 1st dataset.
  3. Compilation of the algorithm of the program (decomposition of the problem).
  4. Application of the algorithm for multiple tasks/metrics.
  5. Formulation and verification of statistical hypotheses regarding the variability of numerical parameters at different time intervals of the dataset.
  6. Conclusions, presentation, dashboard.

Reading, primary analysis and data processing.¶

Data reading.¶

In [1]:
# importing the libraries needed for work

import pandas as pd
from scipy import stats as st
import datetime as dt
import numpy as np
import math as mth
from matplotlib import pyplot as plt 
import seaborn as sns
import os
import plotly.express as px
In [2]:
def file_path(name):
    """
File search function (lowercase file name is the function argument) in Linux and Windows folders 
 (if the notebook works locally).

We read the data file with an attempt to take data locally: 4 attempts for the Yandex-server and 
Windows. The file should be located next to the notebook or in the nested directory "datasets". 
    """
    for p in ('/datasets', '', os.path.abspath(""), os.path.abspath("")+'\\datasets\\'):
        pth1 = os.path.join(p, name)
        if os.path.exists(pth1):
            break # if the search is successful, we break the cycle ahead of schedule
    else:
        pth1 = ''
        print(f"The {name} file wasn't found. Make sure that it", 
              'is present in the directory next to the notebook or in a subfolder "/datasets"')
    print('Data reading from:', pth1, '\n')
    return pth1
In [3]:
name = 'telecom_dataset.csv' # the name of the file to be found and read   
df = pd.read_csv(file_path(name))  # writing the data file to the dataframe 
df.head(3) # let's check the outlook of data
Data reading from: telecom_dataset.csv 

Out[3]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-04 00:00:00+03:00 in False NaN True 2 0 4
1 166377 2019-08-05 00:00:00+03:00 out True 880022.0 True 3 0 5
2 166377 2019-08-05 00:00:00+03:00 out True 880020.0 True 1 0 1
In [4]:
name = 'telecom_clients.csv' # the name of the file to be found and read      
df2 = pd.read_csv(file_path(name))  # writing the data file to the dataframe 
df2.head(3) # let's check the outlook of data
Data reading from: telecom_clients.csv 

Out[4]:
user_id tariff_plan date_start
0 166713 A 2019-08-15
1 166901 A 2019-08-23
2 168527 A 2019-10-29

Finding and handling errors in the first dataframe - df.¶

In [5]:
len_df_ini = len(df) # rows in the dataframe before processing begins (we'll use it to control "losses")
len_df_ini
Out[5]:
53902
In [6]:
def lost_data(df):
    """
    The function of checking the proportion of losses in the dataframe columns
    """
    out = pd.DataFrame(columns = ['column_names','omissions','lost_portion'])
    i = 0
    for column in df.columns:
        if df[column].isna().sum() > 0:
            out.loc[i] = [column, df[column].isna().sum(), round(df[column].isna().sum()/len(df), 4)]
            i += 1
    print(f"\nLost data in the columns: columns with omissions: {i} from {len(df.columns)}.")
    out.set_index('column_names', inplace=True)
    return out

df.drop_duplicates(inplace=True) # remove the explicit duplicates and
print(df.info()) # let's see what's with the losses in the first dataframe
print(lost_data(df).sort_values('lost_portion'))
print('\n---After removing explicit duplicates in the dataframe df',round((len_df_ini-len(df))*100/len(df),2),'% of rows are deleted.')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49002 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              49002 non-null  int64  
 1   date                 49002 non-null  object 
 2   direction            49002 non-null  object 
 3   internal             48892 non-null  object 
 4   operator_id          41546 non-null  float64
 5   is_missed_call       49002 non-null  bool   
 6   calls_count          49002 non-null  int64  
 7   call_duration        49002 non-null  int64  
 8   total_call_duration  49002 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 3.4+ MB
None

Lost data in the columns: columns with omissions: 2 from 9.
              omissions  lost_portion
column_names                         
internal            110        0.0022
operator_id        7456        0.1522

---After removing explicit duplicates in the dataframe df 10.0 % of rows are deleted.

What can be seen from the first dataframe:

  • column names are ok
  • operator_id is of type float64, all IDs end with ".0", let's leave it - maybe this is the client's style
  • in date, it needs to be converted to the date type
  • in internal, it needs to remove the gaps (there are few of them, about 2%)
  • internal needs to be converted to the boolean type (not before deleting the gaps)
  • operator_id has 15% omissions, how to process that? We will solve this issue after the previous ones.
In [7]:
# let's check, are there any implicit duplicates or errors in 3 columns
for col in ('direction','internal','is_missed_call'):
    print(f'Unique values in the column {col} -', df[col].unique())
Unique values in the column direction - ['in' 'out']
Unique values in the column internal - [False True nan]
Unique values in the column is_missed_call - [ True False]
In [8]:
df.dropna(subset=['internal'], inplace=True) # removing the gaps
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d') # changing type
df['internal'] = df['internal'].astype('bool') # changing type
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48892 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype                                
---  ------               --------------  -----                                
 0   user_id              48892 non-null  int64                                
 1   date                 48892 non-null  datetime64[ns, pytz.FixedOffset(180)]
 2   direction            48892 non-null  object                               
 3   internal             48892 non-null  bool                                 
 4   operator_id          41491 non-null  float64                              
 5   is_missed_call       48892 non-null  bool                                 
 6   calls_count          48892 non-null  int64                                
 7   call_duration        48892 non-null  int64                                
 8   total_call_duration  48892 non-null  int64                                
dtypes: bool(2), datetime64[ns, pytz.FixedOffset(180)](1), float64(1), int64(4), object(1)
memory usage: 3.1+ MB
  • 3 issues have been solved: types have been changed, omissions in internal have been removed.
In [9]:
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.25 % of rows are deleted.

Let's move on to the question: operator_id has 15% of gaps (7456), how to handle that?

In [10]:
print('Number of zero-duration calls with "lost" operator_id (non-call):')
df[(df.operator_id.isna()) & (df.call_duration == 0)]['user_id'].count()
Number of zero-duration calls with "lost" operator_id (non-call):
Out[10]:
7288

We will replace these calls in the operator ID with NaN for an ID = 100000.0 (this will be ID "operator of lost calls without an operator")

In [11]:
100000.0 in df.operator_id # на всякий случай убедимся, что ID 100000.0 пока свободен
Out[11]:
False
In [12]:
# assign an artificial operator ID 100000.0 where operator =  NaN and call length = 0
df['operator_id'] = df['operator_id'].where((df.operator_id.notna()) | (df.call_duration != 0), other=100000.0)

Let's check non-zero duration calls with the sign is_missed_call:

In [13]:
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.25 % of rows are deleted.
In [14]:
print('Calls with property is_missed_call of non-zero length:')
df[(df.is_missed_call == True) & (df.call_duration != 0)].head()
Calls with property is_missed_call of non-zero length:
Out[14]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
1606 166405 2019-11-19 00:00:00+03:00 in False 939478.0 True 1 165 173
1635 166405 2019-11-21 00:00:00+03:00 in False 882686.0 True 1 1 5
1657 166405 2019-11-22 00:00:00+03:00 in False 882686.0 True 1 1 5
2283 166407 2019-09-23 00:00:00+03:00 in False 888534.0 True 1 133 156
2869 166482 2019-11-05 00:00:00+03:00 in False 934074.0 True 1 1 15

Since we do not have an explanation of how such calls can have a missed sign (1 second is also not missed), we will change this sign to the opposite for all calls lasting more than 0 (most likely seconds, but there is no data from the customer).

In [15]:
df['is_missed_call'] = df['is_missed_call'].where((df.is_missed_call==False ) | (df.call_duration == 0), other=False)
In [16]:
df[(df.is_missed_call == True) & (df.call_duration != 0)].head() # calls from one second are no longer listed as missed
Out[16]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration

Let's see how many calls there are now without an operator ID:

In [17]:
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.25 % of rows are deleted.
In [18]:
df[(df.operator_id.isna())]['user_id'].count()
Out[18]:
113

Very few, so let's delete them.

In [19]:
# deleting erroneous rows by indexes
df = df.drop(df[(df.operator_id.isna())].index)
In [20]:
print(lost_data(df).sort_values('lost_portion'))
Lost data in the columns: columns with omissions: 0 from 9.
Empty DataFrame
Columns: [omissions, lost_portion]
Index: []

Let's look at the average full length of the call, the highs fit into 1 hour, we consider as normal:

In [21]:
(df['total_call_duration']/df['calls_count']).sort_values().tail(5)
Out[21]:
31087    2329.0
8986     2618.0
34461    2904.0
8370     2907.5
14787    3609.0
dtype: float64

Let's look at the full length of the call - the highs give about 46 hours, it looks like the counter is stuck on weekends (the theoretical maximum is 48 hours). We will allow a maximum time of 9 hours (a working day with a lunch break), we will remove everything superfluous from the database.

In [22]:
df.total_call_duration.sort_values().tail(3) # maximums up to 46 hours
Out[22]:
6808    141930
6950    146903
6946    166155
Name: total_call_duration, dtype: int64
In [23]:
# deleting erroneous rows by indexes
df = df.drop(df[(df.total_call_duration > (9*60*60))].index)
In [24]:
df.total_call_duration.sort_values().tail(3) # checking the result
Out[24]:
30184    30864
7113     32148
6790     32291
Name: total_call_duration, dtype: int64
In [25]:
print('---Now in the dataframe df', round((len_df_ini - len(df)) * 100 /len(df), 2), '% of rows are deleted.')
---Now in the dataframe df 10.81 % of rows are deleted.

Conclusion to 1.2:

  • Thus, the data are transformed into the necessary types, duplicates are removed, gaps are either filled in with "patches" or deleted.
  • The "losses" in the data amounted to 10.81%, and 10% are obvious duplicates, eliminated at the very beginning.

Reference data for the 1st dataset.¶

We will form histograms by dataframe columns, not including the direction, internal and is_missed_call fields. The graphs are built in Plotly, they are interactive and can be scaled (the zoom function is very useful here).

In [26]:
fig = px.histogram(df, x="user_id")
fig.update_layout(title='Histogram of call center IDs',
                  width=600, # specifying the size of the graph
                  height=300,
                 )
fig.show() # printing the graph
In [27]:
fig = px.histogram(df, x="date")
fig.update_layout(title='Histogram by time',
                  width=600, # specifying the size of the graph
                  height=300,
                 )
fig.show() # printing the graph 
In [28]:
fig = px.histogram(df, x="operator_id")
fig.update_layout(title="Histogram for operators' ID (fiction 100000.0 is in the left.)",
                  width=600, # specifying the size of the graph
                  height=300,
                 )
fig.show() # printing the graph 
In [29]:
fig = px.histogram(df, x="calls_count")
fig.update_layout(title='Histogram of the number of calls',
                  width=600, # specifying the size of the graph
                  height=300,
                 )
fig.show() # printing the graph
In [30]:
fig = px.histogram(df, x="call_duration")
fig.update_layout(title='Histogram of the duration of calls (cumulative)',
                  width=600, # specifying the size of the graph
                  height=300,
                 )
fig.show() # printing the graph 
In [31]:
fig = px.histogram(df, x="total_call_duration")
fig.update_layout(title='Histogram of the total duration of calls (cumulative)',
                  width=600, # specifying the size of the graph
                  height=300,
                 )
fig.show() # printing the graph 
  • Histograms do not provide particularly valuable information for most columns, except for time. If we ignore the failures on weekends, daily activity (i.e. the number of entries) continues to increase, which means that the business is growing and developing.
In [32]:
print('Items in dataset.')
len(df)
Items in dataset.
Out[32]:
48645
In [33]:
print('Callcenters in dataset.')
len(set(df.user_id))
Callcenters in dataset.
Out[33]:
307
In [34]:
print('Callcenters IDs in dataset.')
df.user_id.value_counts() # let's take a look at how many times different operator IDs occur
Callcenters IDs in dataset.
Out[34]:
168187    2342
166658    1882
167497    1281
166916    1200
167626    1167
          ... 
168578       1
168459       1
166554       1
166584       1
167981       1
Name: user_id, Length: 307, dtype: int64
In [35]:
print('Number of operators in the dataset (including fiction 100000.0).')
len(set(df.operator_id))
Number of operators in the dataset (including fiction 100000.0).
Out[35]:
1093
In [36]:
print('Operator IDs in the dataset.')
df.operator_id.value_counts() # let's take a look at how many times different operator IDs occur
Operator IDs in the dataset.
Out[36]:
100000.0    7286
901884.0     323
901880.0     300
893804.0     289
887276.0     267
            ... 
960666.0       1
899906.0       1
899912.0       1
883018.0       1
937770.0       1
Name: operator_id, Length: 1093, dtype: int64
In [37]:
rr = df.groupby('date', as_index=False)['operator_id'].nunique()
fig = px.line(rr, x='date',y='operator_id')
fig.update_layout(title='Number of calling operators per day',
                  width=600, # specifying the size of the graph
                  height=400,
                  xaxis_title='Time',
                  yaxis_title='Quantity'
                 )
fig.show() # printing the graph 
  • the number of operators is growing, we will additionally see if the number of operators is growing in terms of 1 call center.
In [38]:
op = df.groupby('date', as_index=False)['operator_id'].nunique() # operators quantities
cc = df.groupby('date', as_index=False)['user_id'].nunique() # callcenters quantities 
len(op) == len(cc) # are the lengths equal, aren't there gaps in the data?
Out[38]:
True
In [39]:
# this column stores the sums of operators divided by number of callcenters
op['op_p_user'] = op['operator_id'] / cc['user_id'] 
In [40]:
fig = px.line(op, x='date',y='op_p_user')
fig.update_layout(title='The number of operators per 1 callcenter',
                  width=600, # specifying the size of the graph
                  height=400,
                  xaxis_title='Time',
                  yaxis_title='Quantity'
                 )
fig.show() # printing the graph
  • It will be shown below that there are call centers where up to 50 operators work. Nevertheless, on average 1-2.5 people work (make calls) in a call center, over time the number of operators per callcenter is growing.
In [41]:
rr = df.groupby('date', as_index=False)['calls_count'].sum()
fig = px.line(rr, x='date',y='calls_count')
fig.update_layout(title='Calls per day',
                  width=600, # specifying the size of the graph
                  height=400,
                  xaxis_title='Time',
                  yaxis_title='Quantity'
                 )
fig.show() # printing the graph 
  • the number of daily calls in the first two months increases, in the 3rd and 4th months it is difficult to assess the presence of growth due to significant weekly fluctuations.
In [42]:
rr = df.groupby('date', as_index=False)['user_id'].nunique()
fig = px.line(rr, x='date',y='user_id')
fig.update_layout(title='Active callcenters per day',
                  width=600, # specifying the size of the graph
                  height=400,
                  xaxis_title='Time',
                  yaxis_title='Queantity'
                 )
fig.show() # printing the graph 
  • obviously, the number of clients (callcenters) has been increasing for 4 months and seems to be stabilizing, while "drops" on weekends are noticeable.

The growing business brings certain difficulties in accounting for both callcenters and operators. The number of callcenters gradually increases in the first half of the period. It is possible that there are some customers who have stopped cooperating with "Callme" during the given period. The solution suggests itself - to cut off 1/3 or half of the period.

In regard of callcenters. Analytics in the 3rd section works on grouping data for each call center separately, time data is taken from the filtered xx.date, and not from the original df.date, so the beginning of the period works correctly. But on the right border of the time interval, the code has been corrected, now it will check whether the client's activity has been in the last d_count days (a test for the client's relevance). For more information, see the function for selecting inefficient operators.

In regard of operators. We don't do a relevance test (operators go on vacation or return back, it's difficult). If the number of operators in the callcenter fluctuates, we do nothing. If the business is dying and the number of operators is reducing - we do nothing. If there are 2 times more operators in the second half of the time period than in the first half, there is rapid growth in the callcenter, together with a lot of changes, so we cut the time period by its older half. For more information, see the function for selecting inefficient operators. (Across the entire dataframe, the ratio is 461 to 997)

In [43]:
print('The beginning and end of the time interval in the data.')
df.date.min(), df.date.max()
The beginning and end of the time interval in the data.
Out[43]:
(Timestamp('2019-08-02 00:00:00+0300', tz='pytz.FixedOffset(180)'),
 Timestamp('2019-11-28 00:00:00+0300', tz='pytz.FixedOffset(180)'))
In [44]:
print('The numbers of external and internal calls:')
df.internal.value_counts() 
The numbers of external and internal calls:
Out[44]:
False    43005
True      5640
Name: internal, dtype: int64
In [45]:
print('The numbers of outgoing and incoming calls:')
df.direction.value_counts()
The numbers of outgoing and incoming calls:
Out[45]:
out    28821
in     19824
Name: direction, dtype: int64
In [46]:
print('The numbers of calls (accumulated cumulatively).')
df.calls_count.value_counts()
The numbers of calls (accumulated cumulatively).
Out[46]:
1      12933
2       6366
3       4047
4       3028
5       2224
       ...  
380        1
514        1
540        1
645        1
748        1
Name: calls_count, Length: 415, dtype: int64
In [47]:
print('Missed calls (number of rows).')
df.is_missed_call.sum()
Missed calls (number of rows).
Out[47]:
21097
  • Some statistics are collected here, which gives a general idea of the data and may be needed further in the analysis and calculations.

Finding and handling errors in the second dataframe - df2.¶

In [48]:
df2.drop_duplicates(inplace=True) # remove the explicit duplicates and
print(df2.info()) # let's see what's with the losses in the second dataframe
print(lost_data(df2).sort_values('lost_portion'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 22.9+ KB
None

Lost data in the columns: columns with omissions: 0 from 3.
Empty DataFrame
Columns: [omissions, lost_portion]
Index: []
  • column names are ok
  • there are no omissions
  • in the second dataframe, we only need to correct the data type in the date_start field.
In [49]:
df2['date_start'] = pd.to_datetime(df2['date_start'], format='%Y-%m-%d') # changing type
df2.info() # checking the result
<class 'pandas.core.frame.DataFrame'>
Int64Index: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      732 non-null    int64         
 1   tariff_plan  732 non-null    object        
 2   date_start   732 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 22.9+ KB
In [50]:
print('For reference. Distribution of tariffs.')
df2.tariff_plan.value_counts()
For reference. Distribution of tariffs.
Out[50]:
C    395
B    261
A     76
Name: tariff_plan, dtype: int64

Conclusion to 1.3:

  • Thus, the data were brought to the right types, there are no duplicates.
  • The second dataframe is not required, we'll use it below only if necessary.

Decomposition of the problem. Definition of the processing and filtering algorithm.¶

Who could possibly need it?¶

The project may be of interest to those callcenters where a large number of operators work and it is impossible to carry out quality control over each of them. Let's see how the number of employees is distributed across callcenters (see section 1.3 for a graph of average values).

In [51]:
yy = df.groupby('user_id', as_index=False)['operator_id'].nunique().sort_values(by='operator_id')['operator_id']
yy.plot(kind='hist', figsize=(5, 4), bins=50)
plt.xticks(rotation = 0)
plt.title('Number of operators by callcenters')
plt.ylabel('Callcenters number')
plt.xlabel('Operators number')
plt.show();
In [52]:
n = 5 # after this number, we believe that there are quite a lot of operators working in the callcenter
print(f'Portion of callcenters, where operators number > {n}, {round(100 * yy.where(yy>n).notna().sum()/len(yy))}%')
Portion of callcenters, where operators number > 5, 20%
  • We see that the clients are small companies, yet up to 20% of them may be somehow interested in automatic control. Let's move on.

Definition of the processing and filtering algorithm.¶

Let's look at how the information is collected in the dataframe provided by the client. Filter by one of the dates and operator ID.

In [53]:
df[(df.date == '2019-10-28') & (df.operator_id == 948182.0)]
Out[53]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
53451 168416 2019-10-28 00:00:00+03:00 out False 948182.0 False 12 539 846
53452 168416 2019-10-28 00:00:00+03:00 in False 948182.0 False 1 97 104
53454 168416 2019-10-28 00:00:00+03:00 out False 948182.0 True 5 0 98

As we can see, the data is already aggregated: we have the number of calls (calls_count) and their total duration (call_duration, total_call_duration) for each combination of date, direction, internal, operator_id. The request to the customer is to provide non-aggregated data, because they can be processed in a large number of ways.

Talking in "tableau" terminology, we have measurement columns - date, direction, internal, operator_id, is_missed_call, it will be possible to filter and select the data by them. And according to the columns of measures - calls_count, call_duration, total_call_duration - numerical metrics can be considered.

Introductory remarks.

What is the "efficient" and "inefficient" work of operators inside their callcenters? Callcenters are different, their business tasks are different, in addition, over time, their profile (duration of calls, ratio of incoming /outgoing calls, etc.) may change. Therefore, attempts to put metrics taken from somewhere outside for each of the callcenters may be successful, or may not be.

Let's take as an axiom (an axiom is something that does not require a proof) the statement that most operators want, strive and work well (i.e. efficiently). Then the sample (target) for each callcenter within a certain time interval will be its average indicators (metrics), since they are typical for most operators. "Anomalies" can be regarded as a sign of inefficient work: for example, the operator may talk too little or too long with the clients, and both will be regarded as a "deviation from the effective average". The final decision of course belongs to the client, since our analysis only highlights the deviations.

The operation of the algorithm.

There can be three situations: the studied criterion (research column) has inefficient operators in the left, in the area of small values, in the right – in the area of large values, or on both sides beyond the boundaries of the segment in the center of which lies the average, and the boundaries are plus or minus the standard deviation multiplied by the global multiplier k (for example 1.0). Signs (levels) of anomaly search in the left and right are set by the parameters l and r.

If the anomalies are on both sides, then the targeted (desired, ideal) average is calculated over the entire time range. But for those searches only on the left or right, the algorithm is more complex. We have to find the best indicator in the past and set it as a model (target), relatively to which the indicators of efficiency or inefficiency are determined.

If the anomalies are only on the left, then the time is divided into segments by d_count days (this is a global parameter, set for all tests), and at each time interval its average (srd) is calculated, and then the largest one is selected from these averages, which becomes a target. Relative to this targeted average, k standard deviations are set to the left (the deviation is calculated over the entire time in the dataframe) and all operators with indicators left from this boundary are marked as anomalous. For anomalies located only in the right, the process is mirrored.

About calculated metrics.

We are able to choose different metrics: primary (by dataframe fields) and secondary (their mathematical combinations). The more columns in the data and the fewer aggregations, the more metrics can be used. To begin with (see the next section), we will construct those three that the client offered us and two more for analyzing the time of incoming and outgoing calls.

Which metrics can be considered (the list is not complete, because one can average, search for maximums and minimums, divide, etc.):

  • missed calls for each operator
  • average waiting time for each operator
  • outgoing calls for each operator
  • talk time for each operator (incoming calls)
  • talk time for each operator (outgoing calls)
  • the ratio of missed calls to accepted calls for each operator
  • the ratio of missed calls to accepted calls for the callcenter
  • average talk time for the callcenter.

Identifying of inefficient operators.¶

In [54]:
# variables for all studies
k = 1.0       # good operators include those who are from the average +- k standard deviations
d_count = 10  # how many days are in each interval into which we divide the dataframe
In [55]:
def report(dframe): # report generation function
    c = dframe.callcenter_id.nunique() 
    o = dframe.operator_id.nunique()
    print(f'\n{dframe.name}.\nFound {c} callcenters, having {o} "anoumalous" operators.')
In [56]:
def find_uneff(l, r, dlt, xx): # the function for selecting inefficient operators
    """
l - boolean is a key telling that we are looking for inefficient operators in the left in small values
r - boolean is a key telling that we are looking for inefficient operators in the right in large values 
dlt (k*np.std) - the run-up plus or minus from the average, beyond this segment the operators are abnormal
xx - the raw dataframe on which the search is going on, the callcenter is fixed when the function is called
    """
# if the callcenter worked d_count days ago, we count it as active, else return []
    if df.date.max()-xx.date.max() < dt.timedelta(days=d_count):

# if there are 2 times more operators in the second half of the time interval than in the first - 
# there is rapid growth in the callcenter, a lot of changes, so we cut out the left part of the period         
        half = (xx.date.max() - xx.date.min()) / 2
        if (xx[xx.date >= xx.date.min()+half]['operator_id'].nunique() >
        2 * xx[xx.date < xx.date.min()+half]['operator_id'].nunique() ):
            xx = xx[xx.date >= xx.date.min()+half] # have cut out the left part of the period
        
        if l & r:
            srd = np.mean(xx.research) # aggregation might be not mean (e.g. sum), it should be everywhere the same 
            xx = xx[['date','operator_id','research']].groupby('operator_id',as_index=False)['research'].mean()
            gr = [xx.loc[i]['operator_id']  for i in range(len(xx)) if xx.loc[i]['research'] > srd+dlt] 
            gl = [xx.loc[i]['operator_id']  for i in range(len(xx)) if xx.loc[i]['research'] < srd-dlt] 
            return gr+gl

        elif l: # looking for the average as a maximum over all the time intervals
# in fact xx.date.min() - is the start date of the operator by which the data is filtered when calling the function
            ld = xx.date.min() 
            rd = xx.date.min() + dt.timedelta(days=d_count)
            srd = -1000000000 # just a very small number
            while rd < xx.date.max():
                srd = max(srd,xx.loc[(xx.date >= ld) & (xx.date < rd)]['research'].mean())
                ld = rd
                rd += dt.timedelta(days=d_count)
            xx = xx[['date','operator_id','research']].groupby('operator_id',as_index=False)['research'].mean()
            gl = [xx.loc[i]['operator_id']  for i in range(len(xx)) if xx.loc[i]['research'] < srd-dlt]
            return gl

        elif r: # looking for the average as a minimum over all the time intervals
            ld = xx.date.min()
            rd = xx.date.min() + dt.timedelta(days=d_count)
            srd = 1000000000 # just a very large number
            while rd < xx.date.max():
                srd = min(srd,xx.loc[(xx.date >= ld) & (xx.date < rd)]['research'].mean())
                ld = rd
                rd += dt.timedelta(days=d_count)
            xx = xx[['date','operator_id','research']].groupby('operator_id',as_index=False)['research'].mean()
            gr = [xx.loc[i]['operator_id'] for i in range(len(xx)) if xx.loc[i]['research'] > srd+dlt] 
            return gr
        else:
            return []
  • in the commercial version, we'd need to return not just the operators IDs, but along with them their research parameter, the average for the callcenter, the number of the time interval in which this record average was calculated, and the key of more / less / normal.

1. Symptom: a lot of missed incoming calls.¶

For each callcenter (user_id), we build our own distribution, filtering by direction (in), internal (False), is_missed_call (True), ranking operators by the number of calls (calls_count).

In [57]:
name = 'Research 1: a lot of missed incoming calls' ###### change each time
    
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = False ###### collect inefficient ones from below, change each time
r = True  ###### collect inefficient ones from above, change each time

research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results

for callcenter in df.user_id.unique():
    option = ((df.direction=='in') ###### change each time
            & (df.is_missed_call==True)  ###### change each time
            & (df.internal==False)  
            & (df.operator_id!=100000.0)
            & (df.user_id==callcenter) 
              )
    xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
    xx['research'] = xx['calls_count'] ###### making a research column
    
    dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
    
    anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
    
    if anomalies != None:
        for i in range(len(anomalies)):
            research.loc[-1] = [callcenter, anomalies[i]]  # add a row
            research.index = research.index + 1  # shift the index
            research = research.sort_index()  # sort by index

research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')

research_01 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_01) 
research_01.name = name ###### assign a name to the dataframe, we might add other service information
report(research_01) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.

    callcenter_id  operator_id
0         168187     948286.0
1         168187     937956.0
2         167497     924936.0
3         167466     921814.0
4         166916     906410.0
5         166541     908960.0
6         166405     882686.0

Research 1: a lot of missed incoming calls.
Found 6 callcenters, having 7 "anoumalous" operators.
  • As we can see, there are very few missed calls, partly this is due to the fact that many calls are not attached to some operator IDs, in our database such lost calls are attributed to the fictitious 100000.0 id. This fact is demonstrated in two cells below.
In [58]:
print('Missed calls with opertator IDs not 100000.0:')
len(df[(df.is_missed_call==True)&(df.operator_id!=100000.0)])
Missed calls with opertator IDs not 100000.0:
Out[58]:
13811
In [59]:
print('Missed calls with operator IDs 100000.0 - lost:')
len(df[(df.is_missed_call==True)&(df.operator_id==100000.0)])
Missed calls with operator IDs 100000.0 - lost:
Out[59]:
7286

2. Symptom: long waiting for a response on incoming external calls.¶

For each callcenter (user_id), we build our own distribution, filtering by direction (in), internal (False), ranking operators by the difference in the amounts of total_call_duration and call_duration divided by the number of calls of calls_count (i.e. by the average waiting time).

In [60]:
name = 'Research 2: long waiting for a response on incoming external calls' ###### change each time
    
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = False ###### collect inefficient ones from below, change each time
r = True  ###### collect inefficient ones from above, change each time

research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results

for callcenter in df.user_id.unique():
    option = ((df.direction=='in') ###### change each time
            & (df.is_missed_call==False)  ###### change each time
            & (df.internal==False)  
            & (df.operator_id!=100000.0)
            & (df.user_id==callcenter) 
              )
    xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
    xx['research'] = round((xx['total_call_duration']-xx['call_duration'])/xx['calls_count']) ###### making a research column
    
    dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
    
    anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
    
    if anomalies != None:
        for i in range(len(anomalies)):
            research.loc[-1] = [callcenter, anomalies[i]]  # add a row
            research.index = research.index + 1  # shift the index
            research = research.sort_index()  # sort by index

research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')

research_02 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_02) 
research_02.name = name ###### assign a name to the dataframe
report(research_02) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.

      callcenter_id  operator_id
0           168601     952916.0
1           168579     973286.0
2           168412     952462.0
3           168336     969380.0
4           168336     958416.0
..             ...          ...
139         166405     902510.0
140         166405     882690.0
141         166405     882688.0
142         166392     891900.0
143         166377     880022.0

[144 rows x 2 columns]

Research 2: long waiting for a response on incoming external calls.
Found 90 callcenters, having 144 "anoumalous" operators.

3. Symptom: few outgoing calls.¶

For each call center (user_id), we build our own distribution, filtering by direction (out), internal (False), ranking operators by the number of calls (calls_count).

Remark. Why didn't I filter the call centers by specialization, despite the task:

  • if the main direction is receiving calls, this does not mean that there cannot be a group of operators who have a task to make outgoing calls.
  • the algorithm does not care what the specialization of the call center is, the study takes place at the operator level for each individual callcenter.
In [61]:
name = 'Research 3: few outgoing calls' ###### change each time
    
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = True   ###### collect inefficient ones from below, change each time
r = False  ###### collect inefficient ones from above, change each time

research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results

for callcenter in df.user_id.unique():
    option = ((df.direction=='out') ###### change each time
            & (df.is_missed_call==False)  ###### change each time
            & (df.internal==False)  
            & (df.operator_id!=100000.0)
            & (df.user_id==callcenter) 
              )
    xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
    xx['research'] = xx['calls_count'] ###### making a research column
    
    dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
    
    anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
    
    if anomalies != None:
        for i in range(len(anomalies)):
            research.loc[-1] = [callcenter, anomalies[i]]  # add a row
            research.index = research.index + 1  # shift the index
            research = research.sort_index()  # sort by index

research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')

research_03 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_03) 
research_03.name = name ###### assign a name to the dataframe
report(research_03) ###### print out the short result 
Callcenters, anomalies and operator IDs with their "abnormal" results.

      callcenter_id  operator_id
0           168601     952916.0
1           168412     953460.0
2           168361     965538.0
3           168361     945314.0
4           168361     945274.0
..             ...          ...
145         166405     882688.0
146         166405     882684.0
147         166377     881278.0
148         166377     880022.0
149         166377     880020.0

[150 rows x 2 columns]

Research 3: few outgoing calls.
Found 67 callcenters, having 150 "anoumalous" operators.

4. Symptom: conversations are too short or too long for incoming external calls.¶

For each callcenter (user_id), we build our own distribution, filtering by direction (in), internal (False), ranking operators by the average duration of calls (call_duration), which is obtained by dividing the sum of time by the sum of the number of calls.

In [62]:
name = 'Research 4: conversations are too short or too long for incoming external calls' ###### change each time
    
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = True  ###### collect inefficient ones from below, change each time
r = True  ###### collect inefficient ones from above, change each time

research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results

for callcenter in df.user_id.unique():
    option = ((df.direction=='in') ###### change each time
            & (df.is_missed_call==False)  ###### change each time
            & (df.internal==False)  
            & (df.operator_id!=100000.0)
            & (df.user_id==callcenter) 
              )
    xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
    xx['research'] = xx['call_duration'] / xx['calls_count'] ###### aking a research column
    
    dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
    
    anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
    
    if anomalies != None:
        for i in range(len(anomalies)):
            research.loc[-1] = [callcenter, anomalies[i]]  # add a row
            research.index = research.index + 1  # shift the index
            research = research.sort_index()  # sort by index

research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')

research_04 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_04.tail(10)) 
research_04.name = name ###### assign a name to the dataframe
report(research_04) ###### print out the short result 
Callcenters, anomalies and operator IDs with their "abnormal" results.

     callcenter_id  operator_id
34         166916     906416.0
35         166916     906394.0
36         166899     894662.0
37         166604     893400.0
38         166536     900192.0
39         166536     884524.0
40         166520     923254.0
41         166407     888538.0
42         166405     939478.0
43         166392     886892.0

Research 4: conversations are too short or too long for incoming external calls.
Found 31 callcenters, having 44 "anoumalous" operators.

5. Symptom: conversations are too short or too long for outgoing external calls.¶

For each call center (user_id), we build our own distribution, filtering by direction (out), internal (False), ranking operators by the average duration of calls (call_duration), which is obtained by dividing the sum of time by the sum of the number of calls.

In [63]:
name = 'Research 5: conversations are too short or too long for outgoing external calls' ###### change each time
    
# collect anomalies in the left: l=True, collect anomalies in the right: r=True
l = True  ###### collect inefficient ones from below, change each time
r = True  ###### collect inefficient ones from above, change each time

research = pd.DataFrame([],columns=['callcenter_id','operator_id']) # temporary for results

for callcenter in df.user_id.unique():
    option = ((df.direction=='out') ###### change each time
            & (df.is_missed_call==False)  ###### change each time
            & (df.internal==False)  
            & (df.operator_id!=100000.0)
            & (df.user_id==callcenter) 
              )
    xx=df[option].groupby(['date', 'operator_id'],as_index=False)[['calls_count', 'call_duration', 'total_call_duration']].sum()
    xx['research'] = xx['call_duration'] / xx['calls_count'] ###### making a research column
    
    dlt = k*np.std(xx.research) # deviation is calculated over the entire time interval from the dataframe
    
    anomalies = find_uneff(l, r, dlt, xx) # getting a list of abnormal operator IDs
    
    if anomalies != None:
        for i in range(len(anomalies)):
            research.loc[-1] = [callcenter, anomalies[i]]  # add a row
            research.index = research.index + 1  # shift the index
            research = research.sort_index()  # sort by index

research.callcenter_id = research.callcenter_id.astype(int)
print('Callcenters, anomalies and operator IDs with their "abnormal" results.')

research_05 = research.copy() ###### creating a copy of the dataframe with the results
print('\n', research_05.tail(10)) 
research_05.name = name ###### assign a name to the dataframe
report(research_05) ###### print out the short result
Callcenters, anomalies and operator IDs with their "abnormal" results.

     callcenter_id  operator_id
46         166800     892534.0
47         166680     950444.0
48         166658     891154.0
49         166520     969600.0
50         166520     910926.0
51         166520     910902.0
52         166520     909452.0
53         166520     891192.0
54         166520     890232.0
55         166485     887280.0

Research 5: conversations are too short or too long for outgoing external calls.
Found 37 callcenters, having 56 "anoumalous" operators.

Remark:

So far, only the IDs of "abnormal" operators are included in the output data. In the commercial version, the output dataframe should have more complete data that would help call centers understand why each operator got into a set of inefficient ones or vice versa.

In [64]:
c = len(set(research_01.callcenter_id) | set(research_02.callcenter_id) | 
  set(research_03.callcenter_id) | set(research_04.callcenter_id)|set(research_05.callcenter_id))
cp = round(100 * c / df.user_id.nunique())
o = len(set(research_01.operator_id) | set(research_02.operator_id) | 
  set(research_03.operator_id) | set(research_04.operator_id)|set(research_05.operator_id))
op = round(100 * o / df.operator_id.nunique())
print(f'In five studies, including intersections, found {c}({cp}%) callcenters, having {o}({op}%) "abnormal" operators.')
In five studies, including intersections, found 117(38%) callcenters, having 314(29%) "abnormal" operators.

Conclusion to 3:

  • Research 1: A lot of missed incoming calls. 6 callcenters were found, which have 7 "abnormal" operators.
  • Research 2: Long waiting for a response on incoming external calls. 90 callcenters were found with 144 "abnormal" operators.
  • Research 3: Too few outgoing calls. 67 callcenters were found with 150 "abnormal" operators.
  • Research 4: Too short or too long conversations with incoming outdoor calls. 31 callcenters were found with 44 "abnormal" operators.
  • Research 5: Conversations that are too short or too long for outgoing outdoor calls. 37 callcenters were found with 56 "abnormal" operators.
  • In five studies, including intersections, 117 (38%) callcenters were found, with 314 (29%) "abnormal" operators.

Formulation and verification of statistical hypotheses.¶

In the last 4 weeks, the number of call centers active every day has fluctuated approximately the same. On weekends, dips are visible, the maximums are approximately at the same level (see clause 1.3).

Let's select 2 intervals - each for 2 weeks, from Monday to Sunday through Sunday:

In [65]:
sample_1 = df.loc[(df.date >= '2019.10.27') & (df.date < '2019.11.10')].sort_values(by='date')
sample_2 = df.loc[(df.date >= '2019.11.10') & (df.date < '2019.11.24')].sort_values(by='date')
In [66]:
# demonstrate this on the graph
rr = df.groupby('date', as_index=False)['user_id'].nunique() # number of callcenters per day
rr.plot(kind='line', x='date', y='user_id', figsize=(5, 3))
plt.axvline('2019.11.10', color='red', alpha = 0.5)
plt.axvline('2019.11.24', color='red', alpha = 0.5)
plt.axvline('2019.10.27', color='red', alpha = 0.5)
plt.xticks(rotation = 0)
plt.title('Number of callcenters per dates')
plt.ylabel('Number of callcenters')
plt.xlabel('Dates')
plt.legend([],[], frameon=False)
plt.show();
In [67]:
# creating 2 series with the numbers of callcenters in these time periods
r1 = sample_1.groupby('date', as_index=False)['user_id'].nunique().user_id
r2 = sample_2.groupby('date', as_index=False)['user_id'].nunique().user_id

Test 1. Mann-Whitney.¶

  • Null hypothesis H0: 2-week samples of the number of callcenters are the same
  • Alternative H1: 2-week samples of the number of callcenters are NOT the same
In [68]:
# compare groups 1 and 2 by Mann-Whitney 

alpha = 0.05
results = st.mannwhitneyu(r1, r2, True, 'less')
print('Wilcoxon-Mann-Whitney test p-value: ', results.pvalue)
if results.pvalue < alpha:
    print('Rejecting the null hypothesis: the difference is statistically significant.')
else:
    print('It was not possible to reject the null hypothesis, the difference is NOT statistically significant.')
Wilcoxon-Mann-Whitney test p-value:  0.4090863915340001
It was not possible to reject the null hypothesis, the difference is NOT statistically significant.

Test 2. T-test for equality of averages.¶

  • Null hypothesis H0: the average values of the number of callcenters in 2-week samples are the same
  • Alternative H1: the average values of the number of callcenters in 2-week samples are NOT the same
In [69]:
# the graphs are similar, Mann-Whitney also considers the sequences similar - the variances should be quite close

# results = calling a method to test a hypothesis
results = st.ttest_ind(r1, r2, equal_var=True) # consider the variances to be close
# alpha = significance level value
alpha = 0.05
# printing p-value
print(results.pvalue)
# conditional operator with the output of a response string
if results.pvalue < alpha:
    print('Rejecting the null hypothesis.')
else:
    print('Not possible to reject the null hypothesis.')
0.5852169029492182
Not possible to reject the null hypothesis.

Test 3. Z-test for the equality of proportions.¶

  • Null hypothesis H0: the proportions of missed calls in 2-week samples are the same
  • Alternative H1: the proportions of missed calls in 2-week samples are NOT the same
In [70]:
alpha = 0.05 # critical level of statistical significance

successes = np.array([sample_1.is_missed_call.sum(), sample_2.is_missed_call.sum()]) # missed calls
trials = np.array([sample_1.is_missed_call.count(), sample_2.is_missed_call.count()]) # calls in total

p1 = successes[0]/trials[0] # proportions of success in the first group
p2 = successes[1]/trials[1] # proportions of success in the second group

# proportions of success in the combined dataset:
p_combined = (successes[0] + successes[1]) / (trials[0] + trials[1])

difference = p1 - p2 # difference in proportions in datasets

# statistics in the standard deviations of the standard normal distribution
z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))

# setting the standard normal distribution (mean 0, standard deviation 1)
distr = st.norm(0, 1) 

p_value = (1 - distr.cdf(abs(z_value))) * 2
print('p-value: ', p_value)

if p_value < alpha:
    print('Rejecting the null hypothesis: there is a significant difference between the shares.')
else:
    print('Not possible to reject the null hypothesis, there is no reason to consider the shares different.')
p-value:  0.01777018753871218
Rejecting the null hypothesis: there is a significant difference between the shares.

Conclusion to 4:

  • Fluctuations in the number of operators in the 1st and 2nd two-week intervals according to Mann-Whitney test are statistically the same.
  • The average number of operators in the 1st and 2nd two-week intervals according to T-test are statistically equal.
  • Nevertheless, according to Z-test, the ratio of lost calls to the total number of calls in these two periods is significantly different.

General conclusions:¶

Analysis of inefficient operators:

  • Research 1: A lot of missed incoming calls. 6 callcenters were found, which have 7 "abnormal" operators.
  • Research 2: Long waiting for a response on incoming external calls. 90 callcenters were found with 144 "abnormal" operators.
  • Research 3: Too few outgoing calls. 67 callcenters were found with 150 "abnormal" operators.
  • Research 4: Too short or too long conversations with incoming outdoor calls. 31 callcenters were found with 44 "abnormal" operators.
  • Research 5: Conversations that are too short or too long for outgoing outdoor calls. 37 callcenters were found with 56 "abnormal" operators.
  • In the five studies, including intersections, 117 (38%) callcenters were found, with 314 (29%) "abnormal" operators.

Statistical research:

  • Fluctuations in the number of operators in the 1st and 2nd two-week intervals according to Mann-Whitney test are statistically the same.
  • The average number of operators in the 1st and 2nd two-week intervals according to T-test are statistically equal.
  • Nevertheless, according to Z-test, the ratio of lost calls to the total number of calls in these two periods is significantly different.

Recommendations for the client:

  • correct the export of operator IDs to integer format instead of float.
  • double-check the correctness of the classification of calls as missed.
  • if possible, try to link each missed call with the ID of a specific operator (possibly a group of operators). Ideally, each missed call should have its "owner".
  • increase the amount of information collected, including disabling data aggregation. The more incoming information, the more "intelligent" metrics can be.
  • depending on the new data that has appeared, increase the number of tests for operator inefficiency (in this work, 5 tests have been done, and as many as possible, for example, client might add the ratio of missed calls to accepted ones for each operator).
  • client can add reporting on the work of the callcenter as a whole, for example, the ratio of missed calls to accepted calls for the callcenter or the average call time for each callcenter.
  • if there is a system which finds inefficient operators, then why not to create a similar one that will look for the most effective ones? Then callcenters could use data from the two systems to organize transfer of best practice to lagging operators (internal training), as well as to encourage their more efficient employees.
In [ ]: